Skip to content

[superlog] Downgrade ClickHouse timeout log from ERROR to WARN in tracking health checks#468

Open
superlog-app[bot] wants to merge 1 commit into
mainfrom
superlog/downgrade-clickhouse-timeout-log-level
Open

[superlog] Downgrade ClickHouse timeout log from ERROR to WARN in tracking health checks#468
superlog-app[bot] wants to merge 1 commit into
mainfrom
superlog/downgrade-clickhouse-timeout-log-level

Conversation

@superlog-app

@superlog-app superlog-app Bot commented Jun 11, 2026

Copy link
Copy Markdown

Summary

The isTrackingSetup dashboard endpoint occasionally logs ERROR when its ClickHouse query for blocked-traffic data exceeds the 10-second timeout. The endpoint itself recovers gracefully — it returns 200 with tracking_issue: null — so the ERROR is a false alarm that creates noisy incidents.

Both getRecentBlockedTrackingIssue and getTrackingEventsStatus share the same Promise.race timeout guard and the same mis-levelled catch block. The fix differentiates between the known timeout (logged as WARN) and genuinely unexpected errors (kept at ERROR), applied symmetrically to both functions.

An alternative approach would be to pass an AbortSignal to the ClickHouse client so the underlying query is cancelled when the race times out, rather than letting it continue running in the background — that would reduce ClickHouse load during busy periods but requires ClickHouse client support for cancellation.

Incident on Superlog


Was this PR helpful? Leave feedback — goes straight to the Superlog team.


Summary by cubic

Downgraded ClickHouse timeout logs to WARN in tracking health checks to stop false alarms. Unexpected errors remain ERROR; endpoint behavior is unchanged.

  • Bug Fixes
    • Detect the known "ClickHouse query timeout" in getTrackingEventsStatus and getRecentBlockedTrackingIssue and log WARN; keep other errors at ERROR.

Written for commit 8f429e8. Summary will update on new commits.

Review in cubic

@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dashboard Ready Ready Preview, Comment Jun 11, 2026 2:21pm
databuddy-status Ready Ready Preview, Comment Jun 11, 2026 2:21pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
documentation Skipped Skipped Jun 11, 2026 2:21pm

@unkey-deploy

unkey-deploy Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Unkey Deploy

Name Status Preview Inspect Updated (UTC)
api (preview) Ready Visit Preview Inspect Jun 11, 2026 2:20pm

@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR downgrade-classifies ClickHouse timeout log events from ERROR to WARN in the isTrackingSetup endpoint, matching the graceful-recovery behavior already in place (both callers return safe defaults on timeout). The fix is applied symmetrically to getTrackingEventsStatus and getRecentBlockedTrackingIssue.

  • The timeout string "ClickHouse query timeout" is a bare literal that appears in four places (two Error(...) constructors and two isTimeout string comparisons) with no shared constant, so a future typo or rename in one spot would silently break the distinction and fall back to ERROR.
  • The PR description itself notes that the background ClickHouse query is not cancelled when the race times out; no code change addresses this, and ClickHouse load during busy periods remains unchanged.

Confidence Score: 4/5

Safe to merge — the change only affects log levels in two catch blocks; the observable behavior of the endpoint is unchanged.

The fix correctly demotes noisy timeout alerts to WARN while preserving ERROR for unexpected failures. The only fragility is that the timeout sentinel string is duplicated four times as a bare literal across throw sites and catch comparisons, with no shared constant to keep them in sync. A future rename in one place would not be caught by the type system and would silently break the discrimination.

packages/rpc/src/routers/websites.ts — specifically the four inline occurrences of "ClickHouse query timeout" that would benefit from a shared constant.

Important Files Changed

Filename Overview
packages/rpc/src/routers/websites.ts Downgrade ClickHouse timeout catch blocks from ERROR to WARN in both getTrackingEventsStatus and getRecentBlockedTrackingIssue; detection relies on an exact string match against "ClickHouse query timeout" with no shared constant.

Sequence Diagram

sequenceDiagram
    participant Client
    participant isTrackingSetup
    participant getTrackingEventsStatus
    participant getRecentBlockedTrackingIssue
    participant ClickHouse
    participant Logger

    Client->>isTrackingSetup: GET /isTrackingSetup
    isTrackingSetup->>getTrackingEventsStatus: call
    isTrackingSetup->>getRecentBlockedTrackingIssue: call

    getTrackingEventsStatus->>ClickHouse: chQuery (analytics.events)
    getRecentBlockedTrackingIssue->>ClickHouse: chQuery (analytics.blocked_traffic)

    alt Query completes within 10s
        ClickHouse-->>getTrackingEventsStatus: result rows
        ClickHouse-->>getRecentBlockedTrackingIssue: result rows
        getTrackingEventsStatus-->>isTrackingSetup: EventsCheckResult
        getRecentBlockedTrackingIssue-->>isTrackingSetup: BlockedTrackingIssueRow or null
    else Timeout fires (10s)
        Note over getTrackingEventsStatus,getRecentBlockedTrackingIssue: Promise.race rejects with Error("ClickHouse query timeout")
        getTrackingEventsStatus->>Logger: logger.warn (was ERROR)
        getRecentBlockedTrackingIssue->>Logger: logger.warn (was ERROR)
        getTrackingEventsStatus-->>isTrackingSetup: hasEvents false, recentEvents 0
        getRecentBlockedTrackingIssue-->>isTrackingSetup: null
        Note over ClickHouse: Query continues running in background
    else Unexpected ClickHouse error
        getTrackingEventsStatus->>Logger: logger.error (unchanged)
        getRecentBlockedTrackingIssue->>Logger: logger.error (unchanged)
    end

    isTrackingSetup-->>Client: 200 tracking_issue null
Loading

Comments Outside Diff (2)

  1. packages/rpc/src/routers/websites.ts, line 168-173 (link)

    P2 The timeout string "ClickHouse query timeout" is duplicated as a bare literal in both getTrackingEventsStatus (line 169) and getRecentBlockedTrackingIssue (line 249), and is now also matched character-for-character in two separate catch blocks. If this string is ever changed in one setTimeout call but not the other — or not in the corresponding isTimeout check — the catch will silently fall through to the ERROR branch, defeating the purpose of this fix. A shared constant would keep all four references in sync.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  2. packages/rpc/src/routers/websites.ts, line 248-253 (link)

    P2 Same duplicated literal issue in the second setTimeout — this should reference the shared constant too so both throw sites stay in sync.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "[superlog] Downgrade ClickHouse timeout ..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants